The principal question of the project was determination of key features and factors of MJ-related crimes committed in Denver in connection with industrial (on the one hand) and non-industrial (on the other hand) objects or victims. The data used was taken from the Denver Police Department and covered the period between years 2015 and 2020. The project highlighted several geographycal units where certain kinds of crimes are more likely to be committed and suggested that MJ-related delinquency in Denver is criminologically closer to property-oriented one and should be combatted accordingly. However, the most important outcome was that any single aspect of a crime (be it locale, type or time) is never sufficient to make good predictions on the rest of the aspects, at least on industrial/non-industrial nature of the crime.
Introduction
Questions
For this project, we want to answer the following questions which are about MArijuana related crimes in Denver, Colorado.
Geographical issues. How much location of the district contribute into crime features (like offense_type and offense_category by neighbourhoods)? Most interestingly we may put it onto the city map and understand does airport nearby help planting weed, do people do it in their flats or rather in countryhouses etc. By this question, we want to investigate if there is any correlation between the type of crime and location
Criminology issues. Relation between Marijuana and other types of crimes (e. g. are they against property or rather violent?).
Machine learning problems: classify whether certrain crime is industry or non-industry type.
With what data?
In order to do our project, we selected a data set from kaggle, called Marijuana related Crime. The dataframe has a shape of 14 variables (6 numeric and 8 char type) and 1201 observations. Vast majority of columns have a few NA values.
The crimes included into the dataset are related or connected with marijuana in any manner, either directly or indirectly. For instance, there may be a crime when marijuana was sold or illegally possessed as well as a crime when just the criminal was a marijuana consumer stealing something in the most ordinary way. Crimes when marijuana itself was a crime target (e. g. was stolen, exacted, its cultivation was illegally infringed) are NOT included into the data.
Marijuana is legalized in the state of Colorado where Denver is located. So when you see a crime labeled as ‘Marijuana possession’, this means that the amount of marijuana was beyond the legal limits, or its quality did not match the law (THC content was too high), or a person was under the age to possess it.
All the columns are more or less clear in terms of waht they mean except for MJ_RELATION_TYPE taking values ‘INDUSTRY’ or ‘NON-INDUSTRY’. Industry-related crimes involve marijuana and licensed marijuana facilities. These reported crimes are committed against the licensed industry or by the industry itself. Non-Industry crimes are crimes reported where marijuana is the primary target in the commission of these crime but the marijuana has no readily apparent tie to a licensed operation.
At the same time all the crimes are somehow related to marijuana or connected with it. This means no comparison with non-MJ related crimes may be presented unless extra data is pulled.
Where does the data come from, how was it collected?
Data in this file are crimes reported to the Denver Police Department which, upon review, were determined to have clear connection or relation to marijuana. These data do not include police reports for violations restricting the possession, sale, and/or cultivation of marijuana. This dataset is based upon the National Incident Based Reporting System (NIBRS) which includes all victims of person crimes and all crimes within an incident. The data is dynamic, which allows for additions, deletions and/or modifications at any time, resulting in more accurate information in the database. Due to continuous data entry, the number of records in subsequent extractions are subject to change.
To Start our project, first we import the packages and our data set.
Code
import warningswarnings.filterwarnings('ignore') from statsmodels.tools.sm_exceptions import ConvergenceWarningwarnings.simplefilter('ignore', ConvergenceWarning)import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn import preprocessingfrom sklearn.linear_model import LogisticRegressionfrom imblearn.over_sampling import SMOTEimport statsmodels.api as smrawdf = pd.read_csv("crime_marijuana.csv")rawdf.head()
INCIDENT_ID
FIRST_OCCURENCE_DATE
LAST_OCCURENCE_DATE
REPORTDATE
INCIDENT_ADDRESS
GEO_X
GEO_Y
DISTRICT_ID
PRECINCT_ID
OFFENSE_CODE
OFFENSE_TYPE_ID
OFFENSE_CATEGORY_ID
MJ_RELATION_TYPE
NEIGHBORHOOD_ID
0
2017151765
06-MAR-17
06-MAR-17
06-MAR-17
2207 N HOOKER ST
3132526.0
1698468.0
1
121
3563
DRUG - MARIJUANA CULTIVATION
Drug Offenses
NON-INDUSTRY\r
sloan-lake
1
20184912
03-JAN-18
03-JAN-18
03-JAN-18
4400 E EVANS AVE
3158749.0
1672408.0
3
314
2205
BURGLARY - BUSINESS NO FORCE
Burglary
INDUSTRY\r
university-hills
2
20184942
03-JAN-18
03-JAN-18
03-JAN-18
3435 S YOSEMITE ST
3173094.0
1663993.0
3
323
2203
BURGLARY - BUSINESS BY FORCE
Burglary
INDUSTRY\r
hampden
3
201666719
01-FEB-16
01-FEB-16
01-FEB-16
5050 N YORK ST
3152054.0
1712498.0
2
212
2203
BURGLARY - BUSINESS BY FORCE
Burglary
INDUSTRY\r
elyria-swansea
4
2016317585
21-MAY-16
21-MAY-16
21-MAY-16
3888 E MEXICO AVE
3157004.0
1674967.0
3
312
2203
BURGLARY - BUSINESS BY FORCE
Burglary
INDUSTRY\r
cory-merrill
Analysis
Cleaning the dataset
In order to do the exploratory data analysis (EDA), we check the types of variables and the missing values. Our data set has 6 numerical and 8 string variables. We have 3 string variables which should be converted into Date Type. We have one missing value in ‘GEO_X’, one missing value in ‘GEO_Y’ and 338 missing values in LAST_OCCURENCE_DATE. So we need to make a decision for the missing values of this variable. Based on the data set documentation, the LAST_OCCURENCE_DATE is basically same as the happened date. Apart from the NaN values, only in 127 cases the LAST_OCCURENCE_DATE does not match the FIRST_OCCURENCE_DATE. As witnessed by area-specific knowledge analysis, the types of the crimes where the FIRST_OCCURENCE_DATE does not match the LAST_OCCURENCE_DATE are mostly non continuous by nature (e. g. burglaries, robberies, thefts etc.). Moreover, only in 30 cases the LAST_OCCURENCE_DATE is not equal to the FIRST_OCCURENCE_DATE and REPORT_DATE at the same time. This means that non-matching values are more likely to be simple errors that to be caused by actual lack of information. At least the remainder, where the mismatch may indeed be informative, is not really large and cannot be extrapolated. Ergo we can replace these missing data with FIRST_OCCURENCE_DATE.
After filling the NaN values of LAST_OCCURENCE_DATE, we drop the rows where we have missing value in ‘GEO_X’ and ‘GEO_Y’.
We have one missing value in ‘GEO_X’, one missing value in ‘GEO_Y’ and 338 missing values in LAST_OCCURENCE_DATE. So we need to make a decision for the missing values of this variable. Based on the data set documentation, the LAST_OCCURENCE_DATE is basically same as the happened date. But why should we believe that is true?
127
BURGLARY - BUSINESS BY FORCE 82
THEFT - OTHER 54
CRIMINAL MISCHIEF - OTHER 42
ROBBERY - STREET 36
DRUG - MARIJUANA SELL 24
THEFT - SHOPLIFT 21
ASSAULT - SIMPLE 18
CRIMINAL TRESPASSING 17
PUBLIC ORDER CRIMES - OTHER 17
AGGRAVATED ASSAULT 14
ROBBERY - BUSINESS 13
THREATS TO INJURE 13
THEFT - FROM BLDG 12
THEFT - ITEMS FROM VEHICLE 11
ROBBERY - RESIDENCE 9
DRUG - MARIJUANA CULTIVATION 7
BURGLARY - RESIDENCE BY FORCE 6
BURGLARY - RESIDENCE NO FORCE 5
BURGLARY - BUSINESS NO FORCE 5
MENACING - FELONY W/WEAP 5
DISTURBING THE PEACE 4
THEFT - OF MOTOR VEHICLE 4
CRIMINAL MISCHIEF - GRAFFITI 3
DRUG - MARIJUANA POSSESS 3
ROBBERY - CAR JACKING 2
DRUG - COCAINE POSSESS 2
DRUG - PCS - OTHER DRUG 2
ROBBERY - PURSE SNATCH W/FORCE 2
FRAUD - CRIMINAL IMPERSONATION 2
THEFT - UNAUTH USE OF FTD 2
BURGLARY - POSS. OF TOOLS 2
FORGERY - OTHER 2
WEAPON-BY PREV OFFENDER-POWPO 1
FORGERY - POSSES FORGE DEVICE 1
ARSON - BUSINESS 1
OTHER ENVIRONMENT/ANIMAL VIOL 1
ASSAULT - DV 1
THEFT - PARTS FROM VEHICLE 1
FORGERY - POSS. OF FORGED FTD 1
ARSON - RESIDENCE 1
POLICE - RESISTING ARREST 1
BURGLARY - VENDING MACHINE 1
WEAPON - FIRE INTO OCC BLDG 1
DRUG - METHAMPETAMINE POSSESS 1
POLICE - FALSE INFORMATION 1
THEFT - EMBEZZLE 1
THEFT - PURSE SNATCH NO FORCE 1
KIDNAP - ADULT VICTIM 1
EXPLOSIVE/INCENDIARY DEV - POS 1
LIQUOR - POSSESSION 1
FORGERY - POSS OF FORGED INST 1
WEAPON- UNLAWFUL DISCHARGE OF 1
CRIMINAL MISCHIEF - MTR VEH 1
WEAPON-POSS ILLEGAL/DANGEROUS 1
THEFT - PICK POCKET 1
DRUG - HEROIN POSSESS 1
Name: OFFENSE_TYPE_ID, dtype: int64
INCIDENT_ID 30
FIRST_OCCURENCE_DATE 30
LAST_OCCURENCE_DATE 30
REPORTDATE 30
INCIDENT_ADDRESS 30
GEO_X 30
GEO_Y 30
DISTRICT_ID 30
PRECINCT_ID 30
OFFENSE_CODE 30
OFFENSE_TYPE_ID 30
OFFENSE_CATEGORY_ID 30
MJ_RELATION_TYPE 30
NEIGHBORHOOD_ID 30
dtype: int64
So we see that the LAST_OCCURENCE_DATE does not match the FIRST_OCCURENCE_DATE not only if it is None, but in 127 more cases. What the nature of these mismatches might be? As witnessed by area-specific knowledge analysis, the types of the crimes where the FIRST_OCCURENCE_DATE does not match the LAST_OCCURENCE_DATE are mostly not continuous by nature (e. g. burglaries, robberies, thefts etc.). Moreover, in 30 cases only the LAST_OCCURENCE_DATE does not equal FIRST_OCCURENCE_DATE and REPORT_DATE at the same time. This means that non-matching values are more likely to be simple errors that to be caused by actual lack of information. At least the remainder, where the mismatch may indeed be informative, is not really large and cannot be extrapolated. Ergo we can replace these missing data with FIRST_OCCURENCE_DATE.
In order to use the time series, we need to change the type of date variables into date. As it was mentioned earlier, we have 3 date variables but their type is string. In order to use the time series, we need to change the type of these variables.
After changing the type, we added 5 new variables into our data set. Then we visualised the yearly trend of the number of crimes. In year 2020, we only have 19 data for the month January.
Code
FOD_series = pd.Series(data = [item.split("-")[0] +"-"+ item.split("-")[1] +"-"+ item.split("-")[2] for item in df['FIRST_OCCURENCE_DATE']], index=df.index)RD_series = pd.Series(data = [item.split("-")[0] +"-"+ item.split("-")[1] +"-"+ item.split("-")[2] for item in df['REPORTDATE']], index=df.index)LOD_series = pd.Series(data = [item.split("-")[0] +"-"+ item.split("-")[1] +"-"+ item.split("-")[2] for item in df['LAST_OCCURENCE_DATE']], index=df.index)FOD_series = pd.to_datetime(FOD_series)LOD_series = pd.to_datetime(LOD_series)RD_series = pd.to_datetime(RD_series)df.drop(['FIRST_OCCURENCE_DATE' , 'LAST_OCCURENCE_DATE','REPORTDATE'], axis =1, inplace =True)df.insert(loc=0, column ='first_occurence_date', value= FOD_series)df.insert(loc=1, column ='last_occurence_date', value= LOD_series)df.insert(loc=2, column ='reported_date' , value = RD_series)year_series = FOD_series.dt.year # Getting year valuesmonth_series = FOD_series.dt.month # Getting month valuesday_series = FOD_series.dt.day # Getting day values as integersday_name_series = FOD_series.dt.day_name() # Getting the days of a week, i.e., Monday, Tuesday, Wednesday etc.# Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.df['Year'] = year_seriesdf['Month'] = month_seriesdf['Day'] = day_seriesdf['Day Name'] = day_name_series# Creating the duration variableduration =(df['last_occurence_date']-df['first_occurence_date'])duration = duration.apply(lambda x: x.days)df.insert(loc=4, column='duration', value = duration)
Visualisation on the map
The problem with this dataset is no long-lat coordinates. Instead we have SPC (state plane coordinates). In the following, we transform the data and map the location of crimes on the map of the Denver. Then we create a new variable named dist which is the distance of occurance place from the city center.
Code
from pyproj import Proj, transformfrom pyproj import CRS ,Transformertransformer = Transformer.from_crs(2232, 4326)df['lat'] = df.apply(lambda x : transformer.transform(x.GEO_X,x.GEO_Y)[0],axis=1)df['long'] = df.apply(lambda x : transformer.transform(x.GEO_X,x.GEO_Y)[1],axis=1)
fig1. This is the way the crimes are located in comparison to each other. More informative maps will follow.
fig2. We use these bounds from BBox to export a picture of the streets of denver from openstreetmaps. It is a fast and easy way to get an overview. Then we import the picture and plot the data points within this picture.
Based on above map, it is evident that the majority of crimes are occuring in the middle of the city and in some areas like the South-East or North-west have less crimes than other parts.
In addition, we define a new variable which is called “dist” and we calculate the distance of the crime from the city center by it:
Code
from math import radians,cos, sin, asin, acos, sqrt, atan2centerlat =39.7392centerlong =-104.9903def calculate_spherical_distance(lat,long):# Convert degrees to radians coordinates = lat, long, centerlat, centerlong phi1, lambda1, phi2, lambda2 = [radians(c) for c in coordinates]# Apply the haversine formula a = sin((phi2-phi1)/2)**2+ cos(phi1) * cos(phi2) * sin((lambda2-lambda1)/2)**2 d =2*6371*atan2(sqrt(a),sqrt(1-a))return ddf['dist'] = df.apply(lambda x: calculate_spherical_distance(x['lat'], x['long']), axis=1)df['dist'].describe()
count 1200.000000
mean 5.811882
std 3.471711
min 0.252880
25% 3.399797
50% 5.299129
75% 7.186625
max 26.514020
Name: dist, dtype: float64
Adding an additional label column
In forensic sciences, a major distinction is made between violent and non-violent crimes. Let us label the offences as violent or not.
fig4. Similar to previous graph, there is not an apperent tendency here. However, we can see for GEO-X more than 3.19 and GEO-Y more than 1.72, Industry crimes did not occur.
Finally, observe violent and non-violent crimes on the map
fig5. From this map, it is evident that the majority of violent and non-violent crimes were committed between 3.14 to 3.16 in GEO-X. But they are distributed in a wider range of GEO-Y (1.67 to 1.71).
Visualizaion of crime categories
To begin with, let us observe the offence categories distribution
Code
ax = sns.countplot(y=df['OFFENSE_CATEGORY_ID'], order = df['OFFENSE_CATEGORY_ID'].value_counts().index, orient ='h')plt.show()
fig6. At the first glance, we can see the most of the crimes are Burglary offences (almost 700) and Larceny, Robbery-street-Res and criminal-Mischief-Property are 150, 100, 90 respectively.
Apart from offence categories, we also have offence types (being more specific and consequently more numerous). We will not observe it because there are too many, and vast majority of them have a very few observtions.
fig8. We can see, the number of violent crimes is twice the number of non-violent crimes.
Trend of MJ Crime over the time
Another important topic that we have to consider is analysing the main variables trend over the time. For doing that, first of all, we define some variables related to time and then visualise the yearly trend of crimes.
Code
Yearlabels = ['2015','2016','2017', '2018', '2019']df['Year'].unique()df_2015 = df.loc[df['Year']==2015]df_2016 = df.loc[df['Year']==2016]df_2017 = df.loc[df['Year']==2017]df_2018 = df.loc[df['Year']==2018]df_2019 = df.loc[df['Year']==2019]plt.figure(figsize=(15, 5))plt.title('Yearly Trend of number of Crimes per Month')plt.plot(df_2015['Month'].sort_values().unique().reshape(-1,1), df_2015.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'g-o',label='2015')plt.plot(df_2016['Month'].sort_values().unique().reshape(-1,1), df_2016.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'b-s',label='2016')plt.plot(df_2017['Month'].sort_values().unique().reshape(-1,1), df_2017.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'k-^',label='2017')plt.plot(df_2018['Month'].sort_values().unique().reshape(-1,1), df_2018.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'r--+',label='2018')plt.plot(df_2019['Month'].sort_values().unique().reshape(-1,1), df_2019.groupby('Month',as_index=False).count()['OFFENSE_CATEGORY_ID'], 'y-o',label='2019')plt.xlabel('Month')plt.ylabel('Number of Crimes')plt.legend(bbox_to_anchor=(1, 1), loc='upper right')plt.grid(True)plt.show()
fig9. At first glance, it is evident that during the year there is fluctuation behavavior for distinct years. Whereas, in some months the general trend can be observed in number of crimes. For instance, in August there is a sudden increase for almost all years. In addition, except 2018, in all of the years, a dramatic drop for the number of crimes can be seen from September to October or November to December.
Trend of violent and non-violent crime over the time
In 2020, we only have data of January and thats why the number of crimes are less than others. The plot shows that the number of violent crimes in each year is more than the number of non-violent ones.
fig10. Above stacked barplot can show us the number of violent crimes is more than non-violent crimes which is already clear. The proportion of violent and non-violent crimes also seems to be roughly the same.
fig11. Trend of violent and non-violent crimes over the months illustrates that from June to September the number of violent and non-violent crimes increased steadily. While, in other months, the trend fluctuated. The proportion is indeed different.
fig12. The number of violent crimes between Tuesday and Thursday is more than other days. At the same time the proportions seems to be different for working and non-working days.
Trend of industry and non-industry crimes over the time
Code
plt.figure(figsize=(5,5))valuetable = pd.crosstab(df[df.Year!=2020]['Year'],df['MJ_RELATION_TYPE'])valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])plt.title('Yearly Trend')plt.xticks(rotation='horizontal')plt.xlabel('Year')plt.ylabel('count')order = [0,1]handles, labels = plt.gca().get_legend_handles_labels()plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1, 1), loc='upper right', labels=['Non-Industry', 'Industry'])plt.show()
<Figure size 480x480 with 0 Axes>
fig13. We can observe that the number of non-industry crimes had decreased gradually. The proportion of non-industry crimes was decreasing all the way long.
Code
Monthlabels = ['January','February','March','April','May','June','July','August', 'September', 'October', 'November', 'December']Monthtickorder =list(list(df['Month'].unique())[i]-1for i inrange(0, 12))Monthtickorder.sort()plt.figure(figsize=(5,5))valuetable = pd.crosstab(df[df.Year!=2020]['Month'],df.MJ_RELATION_TYPE)valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])plt.title('Monthly Trend')plt.xticks(ticks=Monthtickorder, labels=Monthlabels)plt.xlabel('Month')plt.ylabel('count')plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1, 1), loc='upper left', labels=['Non-Industry', 'Industry'])plt.show()
<Figure size 480x480 with 0 Axes>
fig14. Similar to violent and non-violent crimes, the proportion over months differs greatly. Compare, for instance, October and December.
Code
valuetable = pd.crosstab(df['Day Name'],df['MJ_RELATION_TYPE']).reset_index()valuetable.set_index('Day Name', inplace =True)valuetable = valuetable.reindex(Daylabels).reset_index()valuetable.plot.bar(stacked=True, color = ['#ff7f0e', '#1f77b4'])plt.title('Daily Trend')plt.xlabel('Day of Week')plt.ylabel('count')plt.legend(handles = [handles[i-1] for i in order], bbox_to_anchor=(1.01, 1), loc='upper right', labels=['Non-Industry', 'Industry'])plt.xticks(ticks=range(0, 7), labels=Daylabels, rotation=0, fontsize=9)plt.show()
fig15. The split between the weekend and the rest of the week is no as obvious as it was at the respective violent/non-violent bar chart.
Plots concerning offence categories
We will be using ceategories (not types) as there are way less categories than types, so a conclusion may be more generalized. For the next plot we should note that only four principal crime categories (burglary, larceny, street robbery and property mischief). The rest of the crimes have a few observations, and when it is so a trend shown is not that informative.
fig16. In 2016 Denver experienced a gradual rise in the number of burglary crimes. However, in the same time duration the number of crimes had decreased or remained stable. In addition, the number of all kind of crimes (except Robbery-street-Res) rose again by 2018 and then in all the years a decrease trend can be observed.
fig17. Here we should note that the bars are not stacked intentionally for the sake of clarity: the left-most section represents the least criminal month. So we may notice that there are some months (April, May, December, July) that win in different crimes. This displays a variance of crime types we have in the dataset.
For plotting over weekdays we again get the four most frequent crimes, otherwise the plot is hard to grasp.
fig18. The majority of Burglary crimes are occured on Tuesday and Thursday. Whereas, larceny crimes often are happened on Thursday and for Robbery-Street-Res and Criminal-Mischief-crimes, Sunday has the most numerous crimes. As it was true for months, there is no unique leader, many weekdays win for different crime types.
Not let us turn the picture and plot proportion of crime types over time units. The graph over the years was tried and appeared to be too boring, so we went straight to months.
fig19. First we remind that all years except 2020 were considered. Based on the plot, in August the proportion of buglaries incresed. This of April seems to be higher in January, June was famous for street robberies and property mischief was more frequent in September.
fig20. Similar to fig7. this chart shows the trend of most numerous crimes over the weekdays. Here the main surprise is a very distinguished proportion for Sunday unlike any other weekday.
Relation between MJ relation type and distance
We finalize our visualization pert by some extra plots concerning our target variabe- MJ relation type. Particularly we wanted to visualize its possible links to distance from the center (‘dist’ variable).
Firstly, we plot the ‘dist’ variable itself to know its structure better.
fig21. By using the boxplot we can show the various measures and in particular the measures of position of a distribution or compare variables by different distributions. Here it shows that the mean of the ‘dist’ is at around 5 km away from the city center and that outliers are more than 5 times higher that the median. Moreover, we notice a narrow IQR which means that majoritry of crimes is committed between 3.5-6.5 km from the city center. The histogram confirms this conclusion.
Secondly, we plot ‘dist’ against the target value:
fig22. As we can see, based on this chart, for non-industry crimes, variability of distance from city center is more than industry crimes. However, both of them have the same median (around 5). Number of outliers for Non-Industry crimes are more than Industry crimes and are from 17-27 kilometer approximately from city center. The IQR of the industrial crimes is narrower, so we conclude that locale of these crimes is not as spread as of non-industrial.
Analysis of correlation
Correlation matrices need to be composed for two purposes. Firstly, they may be useful for identifying interesting coincidences between the variables (which, however, does not imply any causation); secondly, it will be useful to hunt for collinearity that may influence the ML-models. However, we have a very long dataframe, that is why the correlation matrix is very large. Instead of visualizing it, we rather get top-30 most correlated and anti-correlated variables as a dataframe.
Code
corrdf = df[['MJ_RELATION_TYPE', 'lat', 'long', 'DISTRICT_ID', 'OFFENSE_CATEGORY_ID','NEIGHBORHOOD_ID', 'Month', 'Day Name', 'VIOLENCE_RELATION', 'dist', 'duration']]corrdf = pd.get_dummies(data=corrdf, drop_first=True, columns=['DISTRICT_ID', 'OFFENSE_CATEGORY_ID','NEIGHBORHOOD_ID', 'Month', 'Day Name', 'VIOLENCE_RELATION', 'MJ_RELATION_TYPE'])def get_top_abs_correlations(df, n=10, geography_and_offense_only=False):'''Tune n to change the number of top correlations''' au_corr = df.corr(method='spearman').unstack() au_corr = au_corr.sort_values(ascending=False) au_corr = au_corr.to_frame().reset_index() au_corr = au_corr[au_corr[0]!=1] au_corr.drop_duplicates(subset=0, inplace =True) au_corr.rename(columns={'level_0': 'Value1', 'level_1': 'Value2', 0:'Correlation_coef'}, inplace=True)#filter insignificant correlations au_corr = au_corr[abs(au_corr.Correlation_coef)>0.1] au_corrng = au_corr#filter the correlation df for pairs composed of ('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long') au_corr = au_corr[au_corr.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long', 'dist'))] au_corr = au_corr[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'OFFENSE_CATEGORY_ID', 'lat', 'long', 'dist'))]#get the part where the first column is geographical and the second is offense category au_corr1 = au_corr[au_corr.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist'))] au_corr1 = au_corr1[au_corr1.Value2.str.startswith(('OFFENSE_CATEGORY_ID'))]#get the part where the first column is offense category and the second is geographical au_corr2 = au_corr[au_corr.Value1.str.startswith(('OFFENSE_CATEGORY_ID'))] au_corr2 = au_corr2[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist'))]#concatenate the two parts au_corr = pd.concat([au_corr1, au_corr2]) au_corr = au_corr.sort_values(by='Correlation_coef', ascending=False)ifnot geography_and_offense_only: au_corr = au_corrng[au_corrng.Value1.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID','lat', 'long', 'dist', 'Month', 'Day Name'))] au_corr = au_corr[au_corr.Value2.str.startswith(('NEIGHBORHOOD_ID','DISTRICT_ID', 'lat', 'long', 'dist', 'Month', 'Day Name'))]print(au_corr) au_corr=au_corrng.merge(au_corr, on=['Value1', 'Value2', 'Correlation_coef'], how='left', indicator=True) au_corr=au_corr[au_corr._merge=='left_only'].drop(axis=1, labels='_merge') au_corr = pd.concat([au_corr.iloc[0:n, :], au_corr.iloc[len(au_corr)-1-n:len(au_corr)-1, :]], axis=0)return au_corrget_top_abs_correlations(corrdf)
Value1 Value2 \
117 DISTRICT_ID_5 NEIGHBORHOOD_ID_montbello
121 NEIGHBORHOOD_ID_elyria-swansea DISTRICT_ID_2
123 lat DISTRICT_ID_2
125 DISTRICT_ID_7 NEIGHBORHOOD_ID_dia
127 long dist
... ... ...
13673 lat NEIGHBORHOOD_ID_overland
13679 lat DISTRICT_ID_4
13681 DISTRICT_ID_6 dist
13685 DISTRICT_ID_4 long
13687 lat DISTRICT_ID_3
Correlation_coef
117 0.812210
121 0.547706
123 0.511313
125 0.499374
127 0.467455
... ...
13673 -0.328414
13679 -0.429106
13681 -0.463937
13685 -0.562981
13687 -0.595252
[199 rows x 3 columns]
Value1
Value2
Correlation_coef
1
VIOLENCE_RELATION_violent
OFFENSE_CATEGORY_ID_Burglary
0.625153
8
OFFENSE_CATEGORY_ID_Robbery-Street-Res
MJ_RELATION_TYPE_NON-INDUSTRY\r
0.442271
30
NEIGHBORHOOD_ID_hale
OFFENSE_CATEGORY_ID_Robbery-Business
0.274946
45
DISTRICT_ID_6
MJ_RELATION_TYPE_NON-INDUSTRY\r
0.198805
47
DISTRICT_ID_6
OFFENSE_CATEGORY_ID_Agg ASLT-Other
0.191872
48
DISTRICT_ID_7
OFFENSE_CATEGORY_ID_Theft from Motor Vehicle
0.190352
49
OFFENSE_CATEGORY_ID_Robbery-Street-Res
VIOLENCE_RELATION_violent
0.190183
63
MJ_RELATION_TYPE_NON-INDUSTRY\r
NEIGHBORHOOD_ID_sloan-lake
0.161758
71
MJ_RELATION_TYPE_NON-INDUSTRY\r
NEIGHBORHOOD_ID_gateway-green-valley-ranch
0.154407
72
OFFENSE_CATEGORY_ID_Criminal Mischief-Graffiti
NEIGHBORHOOD_ID_speer
0.153741
241
DISTRICT_ID_6
OFFENSE_CATEGORY_ID_Burglary
-0.213144
242
OFFENSE_CATEGORY_ID_Drug Offenses
OFFENSE_CATEGORY_ID_Burglary
-0.217856
245
OFFENSE_CATEGORY_ID_All Other Crimes
VIOLENCE_RELATION_violent
-0.231180
249
VIOLENCE_RELATION_violent
OFFENSE_CATEGORY_ID_Drug Offenses
-0.269096
251
OFFENSE_CATEGORY_ID_Burglary
OFFENSE_CATEGORY_ID_Criminal Mischief-Property
-0.272626
254
OFFENSE_CATEGORY_ID_Burglary
MJ_RELATION_TYPE_NON-INDUSTRY\r
-0.290677
255
OFFENSE_CATEGORY_ID_Burglary
OFFENSE_CATEGORY_ID_Robbery-Street-Res
-0.299994
256
OFFENSE_CATEGORY_ID_Burglary
OFFENSE_CATEGORY_ID_All Other Crimes
-0.310034
259
OFFENSE_CATEGORY_ID_Criminal Mischief-Property
VIOLENCE_RELATION_violent
-0.336748
260
OFFENSE_CATEGORY_ID_Larceny
OFFENSE_CATEGORY_ID_Burglary
-0.397286
We may highlight that the table shows evident correlations arising from crime classification: e. g. strong positive correlation between violence and being burglary, the opposite for being larceny; between being street robbery and being non-industrial crime, the opposite - for being burglary. It is especially true for the largest negative correlations. We also see that the dataset does not have a plenty of highly correlated values (i. e. with absolute value over 0.5). Yes, there are some, but it is not that much for >100 variables.
With regard to question 1, we also need to look at correlations between offence types and geographycal variables. Let’s do this by filtering the correlation df.
Some hoods correlate with some crimes positively: positive correlation exists between business robberies and Hale neigborhood, aggravated assault and 6th district, theft from motor vehicle and 7the district, Speer neighbourhood and graffity offences.
The only outstanding negative correlation is between burglary and 6th district. With regard to the 6th and the 7th district we may conclude that their correlaions with other variables are mutually connected: e. g. district 6 is strongly correlated with aggravated assault; since this is a non-industrial crime, it is also strongly correlated with MJ_RELATION_TYPE_NON-INDUSTRY. Absolute values of the rest of the correlations are below 0.15.
As a result, we have got a dataset with a few collinearity issues.
Machine learning focused on geographycal predictors
Given the number of predictors, let us first try a random forest classifier. Moreover, this type of model is not sensitive to unscaled data.
We will first attempt using maximum of the predictors (dropping, however, OFFENSE_TYPE_ID in any event). Then we get a feature importance ranking and performance results aining at minimizing number of predictors and finding optimal features of the classifier.
fig23. The model performs not bad, geographycal variables are among the most important ones again. However, we should focus on geo predictors. However, we will not drop non-geo predictors only if their importance ranking is high, otherwise we assume they are not significant.
fig25. The performance is slightly worse again, but still acceptable. This time top-3 predictors are geographycal, but the model remains effective.
At this point we state the predictor subsetting reached a tolerable mark, and now it is time to tune parameters of the model.
The metric we will consider for this purpose will be the test f1-score because it comprises both precision and recall evaluation. We do not take into account the train f1-score because the modelling evidences training f1-score was close to perfect all the time. We visualize the parameters vs. f1-score first and then use the best-fitting parameters to obtain the final model.
fig.fig.26-28. Random forest classifier has a lot of parameters available, but we choose number of trees, tree depth and number of features included into each tree. Common issue of all the plots is absence of uniform trend which is typical for random forest algorithms.
fig. 29. The model has not changed its focus on geo-predictors, the rest of the predictors have not promoted greately. The f1-score is now over 0.91, which is fairly high.
Due to limited interpretability issues of random forest models and desire to get an interpretable result, we now turn to logistic regression. The procedure for this will be as follows: first, we make a ROC-curve to compute the best classifition threshold considering the mean of train and test f1-scores (in random forest train performance was almost perfect all the time, so there was no point in computing the mean). Then we use the threshold to improve the model. All this is done alongside with predictor subsetting. Note that the model is assessed particularly with confusion matrix for both model training and testing. The number of observations seen in the training confusion matrix should not be surprising due to oversampling having been performed.
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.550117
Iterations: 35
Function evaluations: 40
Gradient evaluations: 40
The optimal threshold based on mean test and train f1-score is 0.33
fig30. The ROC-curve for the model containing all the predictors is tolerable but it is clear the model will not be as accurate as the random forest one.
Peformance of the model is not bad, however, a tiny number of coefficients with low p-values indicates that the logistic regression is puzzled. Particularly, if we subset take only the predictors with low p-values, the model will not function properly.
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.542805
Iterations: 35
Function evaluations: 37
Gradient evaluations: 37
The optimal threshold based on mean test and train f1-score is 0.67
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.605019
Iterations: 35
Function evaluations: 38
Gradient evaluations: 38
The optimal threshold based on mean test and train f1-score is 0.59
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.478963
Iterations: 35
Function evaluations: 38
Gradient evaluations: 38
The optimal threshold based on mean test and train f1-score is 0.46
fig33. The ROC-curve of the model based on geographycal preditors, crime duration and violence duration
This way the model does manage to give good predictions.
Machine learning focused on other predictors
We have already built both random forest and logistic regression models using all the variables and then focusing on geo-related variables. Now we focus on time, violence relation and offense categories and follow the same sequence.
fig34. Despite we have plenty of preditors the most important were two of three most common offense categories and violence flag. Time-related predictors did not have much importance.
The above barplot makes us suggest dropping the time-related predictors.
fig35. This model does not seem to be better. One should especially note bad performance on training set (which, in contrast to the test set, has equal number of industrial and non-industrial observations). So the time contributes much into the model’s answer.
However, if we try to refocus the model on time-related predictors only, this will not give a desired result.
Train Confusion Matrix :
[[616 104]
[133 587]]
Train Accuracy : 0.8354166666666667
Train f1-score : 0.8320340184266478
Train Recall : 0.8152777777777778
Train Precision : 0.849493487698987
Test Confusion Matrix :
[[ 15 42]
[ 61 182]]
Test Accuracy : 0.6566666666666666
Test f1-score : 0.7794432548179873
Test Recall : 0.7489711934156379
Test Precision : 0.8125
fig36. Of course, we have not dropped all the predictors that are not time-related, but we excluded the top-3 important of them. The model did not crash, but its performance is far from good.
This is why we should be tuning the initial geo-free model that includes both time-related and category-related predictors.
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.475193
Iterations: 35
Function evaluations: 37
Gradient evaluations: 37
The optimal threshold based on mean test and train f1-score is 0.4
fig42. The ROC curve of the model including all non-geo predictors except for time-related ones.
This model is substantilly worse than the very first one. In such a way, the best logistic regression model not taking into account geographycal data should include both time-related and category-related variables.
Conclusion
Conclusion on q1:
From visualization: The overall map of offences is not distributed uniformly. For example, the south-east of the city seems to be less criminal. However, as for the rest of the crime classifications the map seems to be mostly uniform (all the types: by violence, by offense category, by MJ relation look mixed up without any attraction centers). To sum up, these questions may not be answered by means of mere visualization.
From correlation:
Positive correlation: between business robberies and Hale neigborhood, aggravated assault and 6th district, theft from motor vehicle and 7the district, Speer neighbourhood and graffity offences. This means these places are more vulnerable to these types of crimes. There is also a noticeable correlation between the distance to the city center and probability of burglary.
Negative correlation: no significant negative correlation between any geographycal variable (be in DISTRICT_ID or NEIGHBORHOOD_ID) is present. The only exception is a negative correlation between burglary and 6th district (which means it is not likely to be burglared in the 6th dictrict).
From visualization:The overall map of offences is not distributed uniformly. For example, the south-east of the city seems to be less criminal. However, as for the rest of the crime classifications the map seems to be mostly uniform (all the types: by violence, by offense category, by MJ relation look mixed up without any attraction centers). To sum up, these questions may not be answered by means of mere visualization.
Conclusion on q2:
From visualization:
The crimes presented in the dataset are rather violent than not and are rather connected with industrial MJ-objects than not. This conclusion is vastly determined by predominance of burglaries, and burglary is both violent and industrial. Share of more aggravated crimes, including those connected with violence against people, is below 10%.
However, this does not mean that MJ makes people more violent. It would be more accurate to say that after MJ was legalized, places of its high concentration - industrial MJ-growing sites - appeared. If a person wants to get a lot of MJ, he or she is most likely to burglarize such site. Moreover, classification of burglaries as violent crimes is more a convention and does not match a common meaning of the word ‘violence’. In such a way, predominance of violent crimes may be explained not by severity of people of Denver and not by the MJ impact on mental health, but by the way the MJ-industry is organized and by criminological convention. The main outcome of this observation is an emphasized necessity to control security of MJ industrial sites as the most likely crime locale.
One more notable fact is that a few crimes of the dataset are not connected with other crime object than drugs. In other words, after the MJ was legalized, not so many cases of purely drug crimes connected with it were registered. This demonstrates that MJ-consumers are relatively unconcerned with other hazardous substances.
Finally, we should note that majority of crimes treat MJ as a property (burglaries, larcenies etc.). Therefore a typical MJ-criminal of Denver is not a deep-rooted drug addict but a person wishing to get the hands of MJ just like on any other asset. Hence the MJ-crimes should be mainly combatted by means intended for combatting crimes against property than by countermeasures against illegal drug circulation. For instance, eliminating poverty and social inequality would be more useful compared to increase of MJ-specific control measures.
The correlation part of the report are not particularly valuable for replying the second question.
The machine learning part of the report reveals high degree of importance of information about crime type and the time when it was committed to resolve the posed machine learning problem. However, it was discovered that no model where the crime type and its violence relation were the only predictors could achieve sufficient accuracy and f1-score. Consequently, a purely geographycal portrait of the crime is more valuable than a purely qualitative one. Deep analysis of crime type related random forest splits or crime type dummies’ coefficients for logistic regression would be misleading due to extreme skewness of crimes type map (e. g. a node ‘Burglary or not’ will be very informative for predicting not only ‘MJ_RELATION_TYPE’ but many other target variables as well, which just reminds us about number of burglaries in the dataset and their criminological attributes without recourse to actual features of the MJ-crimes population).
Conclusion on q3:
The posed machine learning problem was resolved by means of random forest and logistic regression. Both types of models proved themselves to be suitable to predict the target value of MJ_RELATION_TYPE.
The main limitation of all the machine learning performed (as well as of the machine learning interpretation performed in qq1,2) was the focus of MJ_RELATION_TYPE variable. In other words, some insights might be obtained from machine learning based aimed at other target values. However, due to the framework of this project full analysis was infeasible, and importance of preditors, their statistical significance were assessed for predicting indistrial or non-industrial nature of the offense only.
Nevertheless, the nature of the task and domain the research was conducted within require the model to be interpretable. As the random forest models optimal number of trees was over 20 in all cases, it was impossible to visualize the trees, so the only visualization means was feature importance ranking. However, the ranking itself is not informative in this case, we are more interested in the split thresholds we cannot visualize due to number of trees. This is why the logistic regression models fits the task better.
Main feature of the logistic regression analysis was constant limitation of the number of predictors. Due to dummification of categorical variables there were over 100 predictors, so initial logistic regression model misfunctioned: a few p-values were low enough to explore them further, coefficients could also be delusive. After predictor subsetting was performed, logistic regression model improved its performance and provided for many significant coefficients which could be interpretable. For instance, burglary dummy variable had a strongly positive coefficient (being a burglary adds up to the chances of the offense to be industrial) while the street robbery dummy coefficient was one of the most negative (as no street robbery is industrial).
However, machine learning was not very insightful in terms of new recommendation for the police not highlighted for questions 1 and 2. The general output of the models designed is that a crime is a very multifaceted phenomenon, and its features cannot be predicted with sufficient accuracy (in common meaning of this term) without predictors reflecting only the time when the crime was committed, only its location, only its type etc. The more predictors related to different crime aspects are at hand, the more successful the model is. In this respect we may recommend the police department of Denver to broaden the data they collect. For example, adding a number of criminals involved or at least a flag of a crime committed by a group, organized gang etc. would be very beneficial.